Skip to content

CollectiveX: experimental cross-vendor collective/EP benchmark#1896

Open
Oseltamivir wants to merge 123 commits into
mainfrom
collectivex
Open

CollectiveX: experimental cross-vendor collective/EP benchmark#1896
Oseltamivir wants to merge 123 commits into
mainfrom
collectivex

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Adds CollectiveX under experimental/CollectiveX/ — a cross-vendor collective / expert-parallel benchmark — plus an orchestration-only workflow.

What it adds

  • Per-SKU launch adapters (launchers/launch_<sku>.sh, the launch_${RUNNER_NAME%%_*}.sh convention) that run any benchmark via a CX_BENCH selector (nccl|deepep|all) through a shared launchers/run_in_container.sh.
  • Benchmarks: run_nccl.py (stock nccl-tests → parsed flat JSON), run_deepep.py (DeepEP dispatch/combine, normal mode), env_capture.py (Layer-0 provenance), plot.py. Every result is correctness-gated and carries a topology-aware comparison_key.
  • Single multi-arch, digest-pinned container for all NVIDIA SKUs (lmsysorg/sglang@sha256:4219…, amd64+arm64); DeepEP via rebuild-deepep. See CONTAINERS.md.
  • .github/workflows/collectivex-experimental.ymlpush to collectivex (paths experimental/CollectiveX/**) → GB200 NCCL smoke; workflow_dispatch → chosen sku+benchmark (B200, DeepEP, larger sweeps). Logic stays under experimental/.

Validated on hardware

  • NCCL primitives: B200 (8× NVLink island) + GB200 (4× NVL72 MNNVL), 4 ops, correctness-passed, topology-keyed distinctly.
  • DeepEP dispatch/combine on GB200: correctness-gated (token conservation + combine vs DeepEP's own reference), ~154 µs roundtrip, 1.66M tok/s.
  • Local: shellcheck/bash -n, py_compile, actionlint, parser fixtures.

Notes / deferred

  • Result JSONs are gitignored (captured env embeds hostnames/UUIDs); CI uploads them as workflow artifacts. Headline numbers are summarized in CONTAINERS.md.
  • Importing the exact multi-arch digest needs the runner's registry creds (validated on the pre-staged v0.5.11-cu130).
  • Precision axes (NVFP4/MXFP8/…), low-latency EP, MoRI, EPLB, multinode DeepEP, and other collectives are captured as roadmap in plan.md, not built.

Note

Low Risk
Changes are isolated to experimental/CollectiveX/ and a read-only workflow; no production benchmark matrix or serving launchers are modified. Risk is mainly operational (self-hosted GPU time, Slurm/enroot failures) rather than app or security impact.

Overview
Introduces CollectiveX under experimental/CollectiveX/ — an experimental cross-vendor collective and MoE EP benchmark — plus orchestration-only .github/workflows/collectivex-experimental.yml. Production serving paths are untouched.

Benchmark stack: run_nccl.py wraps nccl-tests/rccl-tests into provenance-tagged JSON; run_deepep.py and run_mori.py add correctness-gated DeepEP and AMD MoRI dispatch/combine; env_capture.py, summarize.py, and plot.py handle environment capture, CI summaries, and plots. Results use topology-aware comparison_keys so unlike fabrics are not merged blindly.

Execution: Per-SKU Slurm launchers (launch_b200-dgxc.sh, launch_gb200-nv.sh, launch_b200-dgxc-slurm.sh, launch_mi355x-amds.sh) follow the same launch_${RUNNER_NAME%%_*}.sh pattern as serving, with shared common.sh (enroot squash by tag, optional CX_STAGE_DIR rsync, in-container nccl/rccl builds). CX_BENCH selects nccl, deepep, mori, or all via run_in_container.sh.

CI: Push to collectivex runs MI355X MoRI on mi355x runners; workflow_dispatch picks SKU and benchmark (GB200/B200 NCCL, DeepEP, etc.), writes markdown to the job summary, and uploads gitignored results/*.json as artifacts.

Reviewed by Cursor Bugbot for commit 871086d. Bugbot is set up for automated code reviews on this repo. Configure here.

Per-SKU launch adapters (launch_<sku>.sh) that run any benchmark via a CX_BENCH selector through a shared run_in_container.sh; multi-arch digest-pinned sglang container; NCCL-primitive + DeepEP dispatch/combine benchmarks with provenance + correctness gating; and an on:push workflow (GB200 NCCL smoke; workflow_dispatch for B200/DeepEP/larger sweeps).

Validated on hardware: NCCL primitives on B200 (8x NVLink) and GB200 (4x NVL72 MNNVL); DeepEP dispatch/combine on GB200 (correctness-gated).
Comment thread experimental/CollectiveX/launchers/run_in_container.sh Outdated
Comment thread .github/workflows/collectivex-experimental.yml
Comment thread experimental/CollectiveX/run_deepep.py Outdated
Comment thread experimental/CollectiveX/plot.py Fixed
Comment thread experimental/CollectiveX/run_deepep.py Fixed
The GB200 on:push smoke hung 25 min in enroot import: a bare digest ref (repo@sha256:) can't form an anonymous Docker Hub token scope, so enroot prompted for a password and blocked in non-interactive CI. Import by the multi-arch TAG instead (anonymous auth works, same as the serving launchers) and add </dev/null so a missing token fails fast rather than hanging.

Use v0.5.11-cu130 (multi-arch amd64+arm64, index sha256:061fb71f…): v0.5.12-cu130's 62 layers overflow enroot's overlay-based squash creation on these nodes (failed to mount overlay … Invalid argument). v0.5.11-cu130 imports cleanly and is pre-staged on GB200.
Comment thread .github/workflows/collectivex-experimental.yml
Comment thread experimental/CollectiveX/run_nccl.py Outdated
On the GB200 Actions path, CX_STAGE_DIR makes the launcher rsync the tree to compute-visible Lustre and the container writes results/ there; upload-artifact reads the checkout's results/ (empty), so the green smoke produced no artifact. Add cx_collect_results to copy result JSONs from the stage dir back to the checkout after the run (no-op when no staging was used).
Comment thread experimental/CollectiveX/run_deepep.py Outdated
Comment thread experimental/CollectiveX/launchers/launch_gb200-nv.sh Outdated
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
Comment thread experimental/CollectiveX/run_deepep.py Outdated
is_token_in_rank=is_token_in_rank,
num_tokens_per_expert=num_tokens_per_expert,
)
combined_x, _, _ = buffer.combine(recv_x, handle, topk_weights=recv_topk_weights)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatch dtype not applied

Medium Severity

The --dispatch-dtype / CX_DISPATCH_DTYPE value is stored in result metadata but never used when building inputs or calling buffer.dispatch. Runs always use bfloat16 token tensors regardless of fp8 vs bf16, so provenance and comparison keys can describe a different shape than what was measured.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b384171. Configure here.

summarize.py --markdown emits GitHub-flavored markdown tables (NCCL + DeepEP); a per-job 'Results summary' workflow step appends it to $GITHUB_STEP_SUMMARY so the run page shows a rendered table (per the GitHub job-summaries feature). Plain-text mode still drives the in-container result gate.
--timestamp "$TS" || cx_log "WARN: parse $op failed"
done

cx_log "done — JSON artifacts under $CX_DIR/results/"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multinode launcher ignores failures

High Severity

The B200 multinode adapter logs warnings when srun or run_nccl.py fail but always exits successfully. Unlike run_in_container.sh, it never runs summarize.py as a non-zero gate, so workflow_dispatch on b200-multinode can finish green with no valid NCCL results.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
- name: Results summary
if: always()
run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips result failure gate

Medium Severity

Both jobs only run summarize.py --markdown, which is documented to always exit 0. The workflow never runs the plain summarize.py gate on the checkout’s results/ after launch, so a successful Launch step can stay green when the checkout has no valid JSON (e.g. staged runs where copy-back failed).

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

dst="$repo_root/experimental/CollectiveX/results"
mkdir -p "$dst"
cp "$mount_src/experimental/CollectiveX/results/"*.json "$dst/" 2>/dev/null || true
cx_log "copied results from stage dir -> $dst (for artifact upload)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result copy errors ignored

Medium Severity

cx_collect_results wraps the staged-to-checkout cp in 2>/dev/null || true and always logs success, so a failed or empty copy does not affect the launcher exit code and the workflow can pass without uploadable JSON.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

First AMD / cross-vendor reach, scaffolded ahead of Milestone 1:

- run_mori.py: MoRI dispatch+combine (normal mode), correctness-gated,
  mirroring ROCm/mori's dispatch_combine example — int32 routing indices,
  (n,0) fp8 scales, the zero-copy registered-combine-input-buffer staging
  step, and expected = input x (#unique destination ranks). Emits the same
  flat JSON shape (family=moe, backend=mori) with CUDA-event timing.
- launchers/launch_mi355x-amds.sh: AMD adapter — partition compute, no
  account, --cpus-per-task=128, node-local /var/lib/squash imported via srun
  on the allocated node, --container-writable --container-remap-root, forces
  CX_BENCH=mori, mounts the (compute-visible) checkout at /ix.
- launchers/run_in_container.sh: run_mori_suite + mori case (nccl|deepep|mori|all).
- launchers/common.sh: ROCm MoRI image (rocm/sgl-dev:...-mori-0227-2) in
  cx_default_image for mi355x*/mi350x*/mi325x*/mi300x*.
- workflow: mi355x sku + mori benchmark options for workflow_dispatch.
- docs: CONTAINERS.md AMD section, README files/run/risks, plan.md status.

Not yet hardware-validated (no MI355X access) — MoRI's Python API is
version-sensitive (marked ADAPT HERE); the first runner job is the
validation, as GB200 was for DeepEP. The ROCm image isn't digest-pinned yet.
Comment thread experimental/CollectiveX/run_mori.py Fixed
- workflow: replace the on:push GB200 NCCL smoke with the MI355X MoRI
  dispatch/combine run (runs-on: mi355x, CX_BENCH=mori), and name the job
  "CollectiveX Experimental" (no longer "smoke"). GB200/B200 NCCL + DeepEP
  remain on workflow_dispatch.
- launch_mi355x-amds.sh: adapt more faithfully to runners/launch_mi355x-amds.sh
  — squeue by job-name only (no -u), flock -w 600, and clear ROCm gpucore.*
  dumps after the run so the next checkout is clean. Bump default CX_TIME to 60
  for a cold ROCm-image import.
- summarize.py: drop the "N/N results valid." footer from both the job-summary
  (markdown) and plain output; the failure gate still reports invalid results.
  Relabel the MoE section "MoE dispatch+combine (DeepEP / MoRI)".
- docs: README/plan describe push -> MI355X MoRI.
rm -f \"$SQUASH_FILE\"
enroot import -o \"$SQUASH_FILE\" \"docker://$IMAGE\" </dev/null
fi
"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MI355X import errors ignored

High Severity

The node-local enroot import runs inside an srun bash snippet without set -e and with no check after import. A failed import still yields exit 0 from that snippet, so the job continues into pyxis with a missing or corrupt squash file.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.

- name: Launch ${{ inputs.sku }} / ${{ inputs.benchmark }}
env:
RUNNER_NAME: ${{ runner.name }}
run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips multinode staging

Medium Severity

CX_STAGE_DIR is set only when inputs.sku is gb200. The b200-multinode dispatch target uses launch_b200-dgxc-slurm.sh, which documents the same compute-visible checkout requirement but leaves staging unset, so Slurm jobs may not see the repo mount.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.

… default)

First MI355X run reached the MoRI dispatch kernel — salloc, ROCm-image import,
mount, torchrun, 8-rank Gloo + shmem init, and EpDispatchCombineConfig/op/dispatch
all worked, confirming the API signatures. It OOM'd MoRI's default 2 GiB static
symmetric heap (hidden=7168 dispatch/combine buffers across 8 ranks request
~0.9 GiB each).

run_mori.py now sets MORI_SHMEM_HEAP_SIZE before `import mori` (default 16 GiB,
override CX_MORI_HEAP_BYTES). Docstring + CONTAINERS.md record the finding;
correctness/timing validated by the heap-sized re-run.

salloc --partition="$PARTITION" --exclude="$EXCLUDE_NODES" --gres=gpu:"$NGPUS" \
--exclusive --cpus-per-task=128 --time="$TIME_MIN" --no-shell --job-name="$RUNNER_NAME"
JOB_ID="$(squeue --name="$RUNNER_NAME" -h -o %A | head -n1)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slurm job ID not scoped

Medium Severity

launch_mi355x-amds.sh resolves JOB_ID with squeue --name="$RUNNER_NAME" and no -u "$USER", while the other CollectiveX NVIDIA launchers filter by user. On a shared cluster, the first matching job name may belong to another account, so subsequent srun/scancel can target the wrong allocation.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ac3f1b9. Configure here.

The heap-bump run cleared the 2 GiB OOM but then failed registering the 16 GiB
symmetric heap as an RDMA memory region (errno 22 EINVAL, size=17179869184).
ROCm/mori's reference test uses MORI_SHMEM_HEAP_SIZE="6G" single-node — big
enough for the hidden=7168 dispatch/combine buffers, small enough to register.

Match it: default "6G" (override CX_MORI_HEAP_SIZE). The rest of the config
already matches the reference (max_num_inp_token_per_rank=4096, hidden=7168,
backend cpu:gloo,cuda:nccl), so this lands on the proven single-node setup.
Drove run_mori.py to a correct run on 8x MI355X (on-node via salloc+srun):
dispatch+combine numerically correct (combine within tol, max_rel ~2e-3),
~85us round-trip at the decode shape. The first runs surfaced four issues,
all fixed and re-validated:

- RDMA MR ceiling: MoRI registers the WHOLE symmetric heap as one RDMA MR at
  init (even single-node; no disable-RDMA knob). The ionic_rdma NICs cap GPU
  MRs at ~4 GiB — a 6 GiB heap fails (RegisterRdmaMemoryRegion errno 22), 2 GiB
  registers. Hold heap at MORI_SHMEM_HEAP_SIZE=2G (override CX_MORI_HEAP_SIZE).
- Buffer sizing: max_num_inp_token_per_rank 4096 -> max(512, n) so the buffers
  fit the 2 GiB heap (4096 was inherited from the reference test).
- Correctness shape: combine returns the full max-token buffer; compare only
  combined[:n] against expected.
- recv count: read total_recv BEFORE combine (combine resets recv_num, which
  made recv_nonzero a false negative).
- Teardown: MoRI's shmem teardown asserts (CheckStatusValid -> SIGABRT) when the
  op is destroyed after shmem_finalize(); hard-exit after writing results.

Docs (README/plan/CONTAINERS) updated from "scaffolded" to validated, with the
fabric constraints recorded.
Comment thread experimental/CollectiveX/run_mori.py Fixed
Comment thread experimental/CollectiveX/run_mori.py Fixed
…CH=nccl)

Adds the AMD collective-primitive path so all_reduce/reduce_scatter/all_gather/
alltoall run on MI355X, not just MoRI:

- common.sh: cx_build_rccl_tests — clones ROCm/rccl-tests and builds with `make`
  against /opt/rocm (amdclang++/librccl). It's a nccl-tests fork producing the
  same <op>_perf binaries and output format, so run_nccl.py parses it unchanged.
  Validated building + running all 4 ops in-container on MI355X (correctness OK).
- run_in_container.sh: run_nccl_suite picks rccl-tests on ROCm (/opt/rocm or
  hipcc), nccl-tests otherwise; identical op loop + run_nccl.py invocation.
- launch_mi355x-amds.sh: honor CX_BENCH (mori default | nccl) instead of forcing
  mori; same -g N single-node 8-GPU launch.
- docs: README/CONTAINERS note the rccl path.

B200 already has the nccl path; this makes primitives available on all three
SKUs via workflow_dispatch.
Comment thread experimental/CollectiveX/launchers/launch_mi355x-amds.sh
if name:
devices.append(name)
elif _run(["ibstat", "-l"]):
devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ibstat fallback may crash capture

Low Severity

In _rdma, the ibstat -l branch calls _run twice. If the first call succeeds but the second returns None, None.splitlines() raises and env_capture.py aborts before writing provenance JSON for that run.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2b23573. Configure here.

…on-node

launch_gb200-nv.sh now branches on CX_NODES: 1 (default) keeps the single-tray
4-GPU dispatcher path; >1 runs across the NVL72 NVLink fabric (e.g. CX_NODES=2
= 8 GPU) by building nccl-tests MPI=1, running each op across WORLD ranks via
`srun --mpi=pmix` (1 GPU/rank) with the MNNVL env, and parsing on the login node
— mirroring launch_b200-dgxc-slurm but staying on NVLink instead of IB.

Validated on GB200 (2x watchtower-navy trays, 8 GPU): all 4 ops valid, peak
busbw all_reduce 822.8 / reduce_scatter 670.6 / all_gather 651.2 / alltoall
625.0 GB/s — ~30% over single-tray and on par with B200 8-GPU NVLink, i.e.
MNNVL engaged (not an IB fallback).

- common.sh: cx_build_nccl_tests auto-detects MPI_HOME for MPI=1 (Debian OpenMPI
  headers live under /usr/lib/<arch>/openmpi/include; MPI_HOME=/usr fails). Works
  x86_64 + aarch64.
- launch_b200-dgxc-slurm.sh: fix BUILD_IN_CTR path (.nccl-tests/nccl-tests/build).
- workflow: add `nodes` dispatch input -> CX_NODES.
…ective tab

RL-mesh tab: bandwidth + latency vs transfer size, one line per (sku, direction, pattern)
— trainer<->generator, paired (1:1 send/recv) vs redistribute (disjoint all-to-all). B300
682-704 GB/s, H100/H200 ~370. All 9 frontend tabs now populated with real GHA data.
Cleanup (goal): launchers/ now holds ONLY launch_<sku>.sh. Moved the in-container runtime
(common.sh, run_in_container.sh) -> runtime/ and the dev/orchestration helpers (_gha_*,
_keep_newest.py, _v3/_v4/_mi355x/_gb300/_routing/_repro/_sensitivity/_validate*) -> tools/.
Updated launch scripts' source + run_in_container paths ($HERE/../runtime/common.sh;
runtime/run_in_container.sh), run_in_container's CWD-relative common.sh source, + README/
CONTAINERS refs. No functional change to the launch path — validated below via GHA.
if "deepep" not in b and "mori" not in b and not b.startswith("failed_"):
continue
try:
d = json.load(open(f))
… + GHA)

- ep_flashinfer.py (FlashInferBackend, flashinfer.comm.MoeAlltoAll + trtllm variant) wired:
  run_ep.py --backend flashinfer, capability CAP[flashinfer], schema enum, run_flashinfer_suite
  (CX_BENCH=flashinfer), workflow option. Tests the MNNVL-on-x86_64 question empirically.
- allreduce_fw_bench.py (family=allreduce-fw): nccl baseline + flashinfer one-shot/two-shot +
  sglang/vllm custom all-reduce, import-guarded; CX_BENCH=allreduce-fw torchrun path, capability
  passthrough, collector + summarize family, workflow option.
DeepEP fp8 dispatch gains scale-layout recipe variants (same kernel, different cast):
fp8=per-block-128 (default), fp8-pertoken (one scale/token), fp8-directcast (unscaled — zero
scale-transport, the recipe MoRI PR311 replaced). Selected via dispatch_dtype; recorded in
backend_provenance.scale_layout so each is a distinct operating point. Closes goal P1
'Direct-cast FP8 dispatch' + 'Per-token/Per-block scale layout variants' + informs 'Scale
transport overhead'. capability/schema/workflow dtype enums extended.
…e backend tracebacks

- ep_harness --dispatch-dtype choices += fp8-pertoken/fp8-directcast (argparse rejected them
  before the adapter ran — the 6 fp8-recipe runs failed here, not in the cast code).
- allreduce_fw_bench status=valid iff nccl baseline produced bw>0 (framework custom kernels
  not in every image are recorded in frameworks_available, not a failure).
- run_ep.py prints any backend exception's full traceback to STDOUT (torchrun summarizes stderr
  only, hiding new-adapter errors like flashinfer's in CI).
The MoeAlltoAll Mapping requires world_size==tp*pp*cp and realizes MoE-EP as a view over the
TP dim, so pure EP across all ranks = tp_size=world_size, moe_ep_size=world_size (was tp_size=1
-> 'world_size must equal tp*pp*cp, 8!=1*1*1'). Confirms the failure was a Mapping-arg bug, NOT
the MNNVL/pidfd hardware blocker — re-smoking to see if FlashInfer EP runs on x86_64.
… worked)

Mapping now constructs (tp_size=world_size); next: MoeAlltoAll asserts hidden_size required.
Pass hidden_size=args.hidden. Confirms steady progress past construction — not a hardware block.
…l unblock

The MoeAlltoAll workspace bootstrapped its cross-rank comm via MPI (MpiComm().Split), failing
under torchrun (no mpi4py/MPI launch) — that was the actual blocker, NOT CAP_SYS_PTRACE. FlashInfer
ships TorchDistBackend + MnnvlConfig; register it via MnnvlMemory.set_comm_from_config(mapping,
config) before constructing MoeAlltoAll so the symmetric workspace uses the torch.distributed NCCL
group torchrun already set up. Pass mnnvl_config to MoeAlltoAll too.
…pace=False

dispatch + MNNVL workspace now WORK (torch-dist fix proved it's not hardware-gated). combine
failed: payload_in_workspace=True demands the payload at the exact workspace pointer (RuntimeError,
not TypeError, so _call_variants didn't fall through). Clone the recv to a fresh tensor + pass
payload_in_workspace=False so the kernel copies our identity-expert output in itself.
MoeAlltoAll is idle->dispatched->idle and asserts 'dispatch called twice without combine'. The
harness times dispatch in isolation (loops it). Fix: combine_needs_redispatch=True (combine-timing
runs untimed dispatch+stage via pre= before each combine sample) + _reset_moe_fsm() at dispatch
start (so the looped dispatch timing + paired roundtrip stay legal). dispatch+workspace already
proven working on x86_64 via torch-dist — this is the last structural gate before correctness.
… FSM)

Looping isolated dispatch corrupts MoeAlltoAll's symmetric workspace (CUDA launch failure) — the
FSM requires PAIRED dispatch+combine. Add a roundtrip_only backend flag: run_sweep then times ONLY
the paired roundtrip (dispatch->stage->combine, each iter cycling the FSM idle->dispatched->idle
cleanly) and mirrors it into dispatch/combine for schema/plot. FlashInfer EP set roundtrip_only=True.
The roundtrip is goal P0's headline metric, so this is the correct measurement here. (Proven: FlashInfer
EP RUNS on x86_64 via torch-dist MNNVL — not the hardware block the early probe assumed.)
…ame [ep,maxT,H] shape)

FlashInfer source: dispatch->recv [ep_size,max_tokens,hidden]; combine wants payload of the SAME
shape. recv[0] IS the identity-expert output -> hand it straight to combine, no clone (the clone
broke the workspace-view layout -> async CUDA corruption). payload_in_workspace=False so the kernel
stages it. + a one-time rank0 shape log. (FlashInfer EP construct+dispatch+MNNVL-workspace proven
working in the GHA cap-enabled container; this is the combine-contract fix.)
FlashInfer EP RUNS on H100 (8 pts, disp_p99=120.7us, conformant, anomaly-free) — only
semantic_correctness failed. combine takes no gate weights + reduces recv [ep_size,max_tokens,H]
over the per-RANK axis, so identity round-trip = x * distinct_ranks_per_token (like DeepEP normal),
not x*topk. Compute it vectorized from the routing trace; default CX_FLASHINFER_ROUTING_FACTOR=ranks.
…import it

cx_build_uccl clones uccl + copies ep/deep_ep_wrapper/deep_ep -> a uccl_deepep package (relative
imports, no deep_ep shadow). ep_uccl.py imports uccl_deepep.Buffer (group ctor, matching its
Buffer(self.group,...) calls + DeepEP's API) with low-level uccl.ep fallback. The earlier UCCL run
failed because uccl.ep.Buffer is (rank,num_ranks) not (group) — the wrapper is the DeepEP-compatible layer.
…v0.1.1)

Wrapper construct worked (group Buffer), but its dispatch hit get_rdma_buffer signature mismatch:
the main-branch wrapper vs the pip wheel 0.1.1. Clone uccl at the tag matching the installed
version so wrapper + uccl.ep C-extension agree.
…already written)

UCCL EP RUNS + PASSES (H100 smoke: status=valid, correct=True x8, disp_p50=146us, comb_p50=105us)
but SIGSEGVs in symmetric-memory teardown after the JSON is written. Like ep_mori, os._exit(rc) past
the crashy cleanup so a valid result isn't marked failed. UCCL EP now a working backend (v0.1.1
wrapper + cu12 shim).
…+ nvfp4)

Wire quantized dispatch into the FlashInfer EP adapter (MoeAlltoAll, the TRT-LLM
throughput-backend one-sided A2A). The A2A is a dtype-agnostic byte-mover taking
input_payloads as a list, so a quantized dispatch moves [q, scale_factor] and dequants
in stage(). Adds:
 - fp8/fp8-pertoken/fp8-directcast (e4m3, same convention as ep_deepep)
 - mxfp8 (e4m3 + e8m0 block-32, device dequant verified == flashinfer host dequant)
 - nvfp4 (e2m1 + e4m3 block-16, fp4_quantize + e2m1_and_ufp8sf_scale_to_float)
Dequant is cached on the problem (deterministic recv) so the roundtrip measures comm
only, consistent with DeepEP's untimed-stage timing boundary. mxfp4 excluded (its SF is
tile-padded, not per-token-movable through the A2A). Validated end-to-end EP2 on B300:
all correct=True, comm-only latency. Records trtllm lineage in provenance (MoeAlltoAll
lives in flashinfer.comm.trtllm_moe_alltoall) + scale_layout/quant_kind.
…reduce_fusion)

Replace the guessed flashinfer custom-AR entrypoints with the pinned real contract
(flashinfer 0.6.8.post1): trtllm_allreduce_fusion with pattern_code=AllReduceFusionPattern.
kAllReduce (pure AR, no fusion) + use_oneshot True/False = one-shot vs two-shot, over the IPC
workspace from trtllm_create_ipc_workspace_for_all_reduce_fusion. Adds a _SkipSize path so sizes
a kernel can't shape (sub-hidden; two-shot needs token_num>tp_size) are recorded as skipped rows
without failing the impl. Validated EP2 on B300: nccl 56.5, flashinfer-oneshot 59.9 (beats nccl
in the latency regime), flashinfer-twoshot 36.3 GB/s, all correct. Covers goal's FlashInfer /
one-shot / two-shot all-reduce.
nvfp4 (e2m1 FP4) is a Blackwell-native format — FlashInfer's fp4 quantize/dequantize does not
round-trip on Hopper sm90 (validated: correct=True on B300, correct=False on H100). Add
ARCH_ONLY_DTYPES + _sku_arch() so capability.resolve rejects nvfp4 on Hopper cleanly (mxfp8/e4m3
unaffected). Refresh docs/gated.md: FlashInfer EP is DONE on H100+B300 (the prior x86_64-blocked
framing was wrong — only the H200 runner denies CAP_SYS_PTRACE); MXFP8+NVFP4 dispatch done (mxfp4
gated on its tile-padded SF); quant-combine OUTPUT gated (no output_dtype in flashinfer 0.6.8.post1);
FlashInfer one-shot/two-shot all-reduce done.
…d.md refresh

plot_ep.py: add load_allreduce_fw_series() — render family=allreduce-fw (flashinfer one-shot/
two-shot vs nccl) in the existing All-reduce tab (op=all_reduce, one line per impl, per-SKU-family
colors). Purely additive; no JS change. docs/gated.md already refreshed in the prior commit covers
FlashInfer EP done on H100+B300, mxfp8/nvfp4 dispatch, quant-combine-output gated, framework AR.
…e / all-gather / framework AR)

Add a methodology.md section mapping the non-EP collective families to their serving patterns:
TP-activation all-reduce (NCCL ring vs FlashInfer one-shot/two-shot crossover), and the DP-attention
-> TP-MoE all-gather handoff (the standardized byte sweep spans the [total_tokens,hidden] handoff
payload sizes). Documents the SGLang DP-attention use case honestly (size coverage present; named
per-model shapes = further lift).
Add deepep-hybrid as a 5th NVIDIA EP backend. The hybrid-ep branch (NVIDIA's TMA + warp-pipeline
impl, deep_ep.HybridEPBuffer) is built from source by cx_build_deepep_hybrid with 3 container fixes
(CUDA-13 cccl include via CPATH; unversioned libnvshmem_host.so symlink; NVSHMEM_DIR). ep_deepep_
hybrid.py wraps HybridEPBuffer (distinct API: construct with hidden/max_tokens/local_experts;
dispatch(hidden,topk_idx=,topk_weights=,num_of_experts=) -> (recv,_,_,handle); combine(recv,handle=)).
Intranode NVLink path (<=8 ranks, one NVLink domain). Identity-expert combine uses the per-rank-sum
'ranks' factor (verified: 8-rank topk=8 relerr 4.28 == E[distinct ranks] 5.26). Validated end-to-end
EP2 + EP8 on B300: correct=True, status=valid. Wired capability (Blackwell+Hopper, bf16 normal),
run_ep --backend, run_in_container CX_BENCH case, schema enum, workflow option. fp8 (use_fp8) +
internode NVLink<->RDMA forwarding are further lift (multi-node).
…cache/rl-mesh/allreduce-fw)

launch_mi355x-amds.sh previously forced CX_BENCH=mori for anything except mori/nccl, blocking the
AMD-capable collective families. Extend the allow-list to mori|nccl|kv-cache|rl-mesh|allreduce-fw
(all run on ROCm/CDNA4: HIP memcpy for kv-cache, torch.distributed->RCCL for rl-mesh/allreduce-fw;
allreduce-fw's flashinfer one/two-shot self-skip on the ROCm image, leaving a valid RCCL baseline).
NVIDIA-only benches still fall back to mori (and capability rejects them on amd anyway).
…a2a_combine

Confirmed from the flashinfer main-branch source that moe_a2a_combine/MoeAlltoAll.combine gained
output_dtype + output_scales + output_scalar_scale (bf16/fp8_e4m3fn/packed-fp4) in a newer release
(PR3376/3643) — absent in the bundled 0.6.8.post1. Wire a quantized COMBINE OUTPUT path (fp8 e4m3):
 - run_in_container cx_build_flashinfer_latest: pip -U flashinfer when CX_COMBINE_DTYPE!=bf16 (asserts
   the upgraded combine has output_dtype); dispatch path stays on the bundled version.
 - ep_flashinfer combine(): single-shot quant-combine (output_dtype=fp8 + output_scales[T,1] per-token,
   CX_QC_SCALE override), dequant cached/untimed for correctness (the fp8 reduction is timed).
 - capability flashinfer combine_dtypes={bf16,fp8}; harness/run_ep gate already reads SUPPORTED_COMBINE_*.
 - workflow: input-cap-safe 'flashinfer-combine-fp8' benchmark choice -> CX_BENCH=flashinfer +
   CX_COMBINE_DTYPE=fp8 (capability-validate mapped). nvfp4/mxfp8 combine reserved until fp8 validates.
…ersions

The first combine-fp8 GHA run upgraded flashinfer-python 0.6.8.post1 -> 0.6.13 but left
flashinfer-cubin at 0.6.8.post1, so import flashinfer raised a version-mismatch (the combine
output_dtype IS in 0.6.13). Upgrade flashinfer-python + flashinfer-cubin + flashinfer-jit-cache
together and export FLASHINFER_DISABLE_VERSION_CHECK=1 as a fallback.
…ter)

MoRI runs on the shared MI355X timed out at the default 900s per-phase guard (cold enroot import +
node contention + MoRI's slowness at larger T). Raise CX_RUN_TIMEOUT to 1800s in launch_mi355x-amds.sh
(fits the 60-min salloc). Pairs with reduced --timing on re-dispatch.
…_dtype (not in 0.6.13 PyPI)

The combine output_dtype is on flashinfer main but not the latest PyPI release (0.6.13), so pip -U
was insufficient (asserted 'combine still has no output_dtype'). Install from the nightly wheel
index https://flashinfer.ai/whl/nightly/ (flashinfer-python --no-deps + cubin + cu130 jit-cache),
which is built from main and has the quantized-combine output path.
…CuTe DSL coupling)

The nightly (main) flashinfer's CuTe-DSL kernels import newer cutlass.cute symbols
(OperandMajorMode) than the container's bundled nvidia-cutlass-dsl — import failed. Upgrade
nvidia-cutlass-dsl alongside the nightly flashinfer-python/cubin/jit-cache.
…ance

The flashinfer nightly upgrade happens AFTER env_capture, so the upgraded versions live nowhere
else. cx_build_flashinfer_latest now captures CX_FLASHINFER_STACK (flashinfer-python/cubin/jit-cache
+ nvidia-cutlass-dsl + torch versions) — logged to the GHA log AND read into ep_flashinfer's
backend_provenance (flashinfer_stack). Reproducibility for the quant-combine runs (which depend on a
specific newer flashinfer+cutlass-dsl set).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant